After mungling and merging data, we have 15 variables and 175 observations. It is the 175 months ranging from 2005 to 2019.
Missing data could give big impact on modeling. We recheck again to see if there is missing data. We are using missing plot, on which the variables are on the yaxis and instances are on the xaxis. Below graph shows that there is no red color so there is no missing data in this dataset.
## Loading required package: Rcpp
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.5, built: 2018-05-07)
## ## Copyright (C) 2005-2019 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
## Warning in if (class(obj) == "amelia") {: the condition has length > 1 and
## only the first element will be used
## Warning: Unknown or uninitialised column: 'arguments'.
## Warning: Unknown or uninitialised column: 'arguments'.
## Warning: Unknown or uninitialised column: 'imputations'.
## Observations: 174
## Variables: 16
## $ asx <dbl> 4107, 4173, 4110, 3983, 4106, 4278, 438…
## $ oecd_li <dbl> 100.21160, 100.14570, 100.06870, 100.01…
## $ abs_imports <dbl> 11154780, 11123461, 12699350, 12569908,…
## $ abs_exports <dbl> 9232638, 9503409, 10451659, 11566650, 1…
## $ gold_price_london_fixing <dbl> 438.00, 423.80, 436.55, 427.50, 433.20,…
## $ unemployment <dbl> 5.056507, 5.660385, 5.796897, 5.503269,…
## $ rba_cash_rate <dbl> 5.25, 5.25, 5.50, 5.50, 5.50, 5.50, 5.5…
## $ yearly_inflation <dbl> 2.5, 2.5, 2.5, 2.4, 2.4, 2.4, 2.5, 2.5,…
## $ quarterly_inflation <dbl> 1.0, 1.0, 1.0, 0.5, 0.5, 0.5, 0.6, 0.6,…
## $ exchange_rate <dbl> 0.7744, 0.7905, 0.7719, 0.7811, 0.7557,…
## $ djia <dbl> 10490, 10766, 10504, 10193, 10467, 1027…
## $ pe_ratio <dbl> 17.0, 15.1, 14.9, 14.3, 14.3, 15.2, 15.…
## $ dividend <dbl> 3.6, 3.8, 3.9, 4.0, 4.0, 3.8, 3.7, 3.4,…
## $ iron <dbl> 28, 28, 28, 28, 28, 28, 28, 28, 28, 28,…
## $ oil <dbl> 47, 48, 54, 53, 50, 56, 59, 65, 66, 62,…
## $ Date <date> 2005-01-01, 2005-02-01, 2005-03-01, 20…
After removing of missing values, one line of data is cut off. Now we have 15 variables with 174 observations.
As the goal is to predict the up/down of asx variable, we add one column called ‘direction’ with value 1 for up and 0 for down. In this report, we will use direction as a response vairable, as that shows whether the asx went up or down since the previous day.
## # A tibble: 6 x 2
## asx direction
## <dbl> <fct>
## 1 4107 0
## 2 4173 1
## 3 4110 0
## 4 3983 0
## 5 4106 1
## 6 4278 1
Starting with histogram could provide us the indication of distribution of some variables
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
Dividend, direction, unemployment are quite skew on one side while asx, abs_imports, unemployment are quite normal distribution. The rest of other variables are about bi-modal distribution. To withdraw a little understanding from this plot, the unemployment and abs_imports are quite similar distribution as asx. Additionally, the up direction is quite bigger than down direction.
Distribution can also be seen by box plot. By this plot, oecd_li, pe_ratio and divident are having few outliers while rba_cash_ra and iron are quite distributed.
What should we deal with outliers here???? I am stil thinking ……..
Now, we plot the correlation between each pair of numeric variables to give an idea of which variables change together.
## corrplot 0.84 loaded
The correlation matrix is used where blue represents positive correlation and red negative and larger dot the larger correlation. In this report, we focus on the correlation between asx and other variables. On this plot, asx seems to have postive correlation with djia, pe_ratio, oecd_li, abs_imports and abs_exports as well as negative correlation with dividend, yearly_inflation, iron, exchange_rate.
Given those high correlation variables, we put those into scatter plots matrix with direction up/down as an indicator, blue for up and red for down.
The djia, pe_ratio could give some signals of positive correlation but not too strong here.
Now we take the further step to view the distribution of each variable broken down into direction. Below plot presents to us the distribution of up and down are quite similar in all variables. There are slight differences of up and down at iron, oil, quarterly_inflation and abs_exports but it is still hard to predict something here. In general, it is hard to predict up and down if using only one or two variables.
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
## Warning in draw.key(simpleKey(...), draw = FALSE): not enough rows for
## columns
Next, we could explore trend of all variables with asx to observe the trend differences. We do log scale these variables to be able to place those on same plot without losing each individual trend.
At this angle, even we log scale these variables but it is still hard to discover trend between multiple variables. However it seems that there is no variable having same trend as asx.
##
## Call:
## summary.resamples(object = results)
##
## Models: lda, cart, knn, svm, rf, glm
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## lda 0.5000000 0.5555556 0.6176471 0.6218954 0.6617647 0.7647059 0
## cart 0.3333333 0.4656863 0.5718954 0.5356209 0.6053922 0.7058824 0
## knn 0.4444444 0.5073529 0.5718954 0.5882353 0.6764706 0.7647059 0
## svm 0.4444444 0.5637255 0.5882353 0.5934641 0.6617647 0.7058824 0
## rf 0.4444444 0.5367647 0.6568627 0.6349673 0.7058824 0.8235294 0
## glm 0.5000000 0.5637255 0.6111111 0.6218954 0.6911765 0.7647059 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## lda -0.02531646 0.06447368 0.18547858 0.203018860 0.3034515 0.5142857
## cart -0.42748092 -0.15378289 0.03751351 -0.009633965 0.1571974 0.3511450
## knn -0.12500000 -0.02271869 0.09200922 0.127764886 0.3053168 0.4687500
## svm -0.21621622 0.01200000 0.07211732 0.095555161 0.2678865 0.3511450
## rf -0.15384615 0.02074045 0.30644599 0.236478762 0.3724578 0.6277372
## glm -0.02531646 0.08143173 0.21237693 0.206927120 0.3230603 0.4687500
## NA's
## lda 0
## cart 0
## knn 0
## svm 0
## rf 0
## glm 0
Running logistic with random partition
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:psych':
##
## outlier
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
##
## Call:
## glm(formula = direction ~ ., family = binomial(logit), data = combi_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1880 -0.7887 0.2982 0.7460 2.2477
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.208811 0.227821 0.917 0.3594
## oecd_li -0.462097 0.362575 -1.274 0.2025
## abs_imports 0.343740 0.248223 1.385 0.1661
## abs_exports 0.135808 0.239707 0.567 0.5710
## gold_price_london_fixing 0.016457 0.257003 0.064 0.9489
## unemployment -0.383793 0.325225 -1.180 0.2380
## rba_cash_rate 0.398722 0.620285 0.643 0.5204
## yearly_inflation -0.346929 0.373206 -0.930 0.3526
## quarterly_inflation -0.549619 0.294286 -1.868 0.0618 .
## exchange_rate -0.173940 0.285152 -0.610 0.5419
## djia 1.767087 0.378839 4.664 3.09e-06 ***
## pe_ratio 0.119773 0.639629 0.187 0.8515
## dividend -0.676071 0.660572 -1.023 0.3061
## iron 0.039304 0.243980 0.161 0.8720
## oil 0.008865 0.262081 0.034 0.9730
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 191.21 on 139 degrees of freedom
## Residual deviance: 130.72 on 125 degrees of freedom
## AIC: 160.72
##
## Number of Fisher Scoring iterations: 5
## prediction
## 0 1
## 0 9 5
## 1 5 15
Running default k-fold
## Generalized Linear Model
##
## 174 samples
## 14 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 156, 157, 156, 157, 157, 157, ...
## Resampling results:
##
## Accuracy Kappa
## 0.7019608 0.3851978
Running k-fold manual with 50 times
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following object is masked from 'package:purrr':
##
## compact
##
|
| | 0%
##
|
|= | 2%
|
|=== | 4%
|
|==== | 6%
|
|===== | 8%
|
|====== | 10%
|
|======== | 12%
|
|========= | 14%
|
|========== | 16%
|
|============ | 18%
|
|============= | 20%
|
|============== | 22%
|
|================ | 24%
|
|================= | 26%
|
|================== | 28%
|
|==================== | 30%
|
|===================== | 32%
|
|====================== | 34%
|
|======================= | 36%
|
|========================= | 38%
|
|========================== | 40%
|
|=========================== | 42%
|
|============================= | 44%
|
|============================== | 46%
|
|=============================== | 48%
|
|================================ | 50%
|
|================================== | 52%
|
|=================================== | 54%
|
|==================================== | 56%
|
|====================================== | 58%
|
|======================================= | 60%
|
|======================================== | 62%
|
|========================================== | 64%
|
|=========================================== | 66%
|
|============================================ | 68%
|
|============================================== | 70%
|
|=============================================== | 72%
|
|================================================ | 74%
|
|================================================= | 76%
|
|=================================================== | 78%
|
|==================================================== | 80%
|
|===================================================== | 82%
|
|======================================================= | 84%
|
|======================================================== | 86%
|
|========================================================= | 88%
|
|========================================================== | 90%
|
|============================================================ | 92%
|
|============================================================= | 94%
|
|============================================================== | 96%
|
|================================================================ | 98%
|
|=================================================================| 100%
## [1] 0.7511111
## [1] 0.7511111
## [1] NaN
## [1] 0.1680873
1. Use default setting when train model
Build the model with the default values:
The algorithm tested three different values of mtry: 2, 8, 14. The optimum value is 14 ith accuracy 0.64. Next step is to find a better mtry.
Step 2 Find better mtry
Test the model with values of mtry from 1 to 14.
## [1] 0.75
We find out that the optimum value of mtry is 10 with accuracy 0.69. We would go for 10.
Step 3. Search the best maxnodes
##
## Call:
## summary.resamples(object = results_mtry)
##
## Models: 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 5 0.5714286 0.7142857 0.7857143 0.7500000 0.7857143 0.8571429 0
## 6 0.5714286 0.6607143 0.7500000 0.7285714 0.7857143 0.8571429 0
## 7 0.5714286 0.7142857 0.7857143 0.7357143 0.7857143 0.7857143 0
## 8 0.5714286 0.6607143 0.7857143 0.7357143 0.7857143 0.8571429 0
## 9 0.5714286 0.7142857 0.7857143 0.7428571 0.7857143 0.8571429 0
## 10 0.6428571 0.7142857 0.7500000 0.7571429 0.7857143 0.9285714 0
## 11 0.5714286 0.7321429 0.7857143 0.7428571 0.7857143 0.7857143 0
## 12 0.5714286 0.7142857 0.7857143 0.7357143 0.7857143 0.7857143 0
## 13 0.6428571 0.7321429 0.7857143 0.7571429 0.7857143 0.8571429 0
## 14 0.6428571 0.7142857 0.7500000 0.7357143 0.7857143 0.7857143 0
## 15 0.6428571 0.7142857 0.7500000 0.7500000 0.7857143 0.8571429 0
## 16 0.6428571 0.7142857 0.7500000 0.7500000 0.7857143 0.8571429 0
## 17 0.6428571 0.7142857 0.7500000 0.7428571 0.7857143 0.8571429 0
## 18 0.6428571 0.7142857 0.7500000 0.7428571 0.7857143 0.8571429 0
## 19 0.6428571 0.7142857 0.7500000 0.7428571 0.7857143 0.8571429 0
## 20 0.6428571 0.7142857 0.7500000 0.7428571 0.7857143 0.8571429 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 5 0.04545455 0.3913043 0.5432624 0.4726986 0.5794743 0.7200000 0
## 6 0.08695652 0.2823985 0.4623188 0.4267390 0.5531915 0.7200000 0
## 7 0.08695652 0.3705534 0.5432624 0.4466595 0.5531915 0.5882353 0
## 8 0.08695652 0.2893154 0.5333333 0.4437087 0.5531915 0.7200000 0
## 9 0.08695652 0.3913043 0.5333333 0.4606169 0.5531915 0.7200000 0
## 10 0.25531915 0.3913043 0.4974359 0.4965566 0.5531915 0.8510638 0
## 11 0.12500000 0.4268116 0.5432624 0.4674335 0.5531915 0.5882353 0
## 12 0.12500000 0.3913043 0.5333333 0.4512448 0.5531915 0.5882353 0
## 13 0.25531915 0.4268116 0.5531915 0.4966973 0.5794743 0.6956522 0
## 14 0.25531915 0.3913043 0.4974359 0.4516070 0.5482270 0.5882353 0
## 15 0.25531915 0.3913043 0.4974359 0.4810154 0.5531915 0.7200000 0
## 16 0.25531915 0.3913043 0.4974359 0.4810154 0.5531915 0.7200000 0
## 17 0.25531915 0.3913043 0.4974359 0.4678389 0.5531915 0.6956522 0
## 18 0.25531915 0.3913043 0.4974359 0.4678389 0.5531915 0.6956522 0
## 19 0.25531915 0.3913043 0.4974359 0.4678389 0.5531915 0.6956522 0
## 20 0.25531915 0.3913043 0.4974359 0.4678389 0.5531915 0.6956522 0
Running maxnode from 5 to 30, the optimum value is 13.
Step 4. Search the best ntrees
##
## Call:
## summary.resamples(object = results_tree)
##
## Models: 250, 300, 350, 400, 450, 500, 550, 600, 800, 1000, 2000
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 250 0.6428571 0.7321429 0.7857143 0.7500000 0.7857143 0.7857143 0
## 300 0.6428571 0.7321429 0.7857143 0.7571429 0.7857143 0.8571429 0
## 350 0.6428571 0.7321429 0.7857143 0.7500000 0.7857143 0.7857143 0
## 400 0.6428571 0.7321429 0.7857143 0.7500000 0.7857143 0.7857143 0
## 450 0.6428571 0.7321429 0.7857143 0.7571429 0.7857143 0.8571429 0
## 500 0.6428571 0.7321429 0.7857143 0.7500000 0.7857143 0.7857143 0
## 550 0.6428571 0.7321429 0.7857143 0.7500000 0.7857143 0.7857143 0
## 600 0.6428571 0.7321429 0.7857143 0.7500000 0.7857143 0.7857143 0
## 800 0.6428571 0.7321429 0.7857143 0.7500000 0.7857143 0.7857143 0
## 1000 0.6428571 0.7321429 0.7857143 0.7500000 0.7857143 0.7857143 0
## 2000 0.6428571 0.7321429 0.7857143 0.7500000 0.7857143 0.7857143 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 250 0.2553191 0.4268116 0.5432624 0.4804654 0.5531915 0.5882353 0
## 300 0.2553191 0.4268116 0.5531915 0.4966973 0.5794743 0.6956522 0
## 350 0.2553191 0.4268116 0.5432624 0.4804654 0.5531915 0.5882353 0
## 400 0.2553191 0.4268116 0.5432624 0.4804654 0.5531915 0.5882353 0
## 450 0.2553191 0.4268116 0.5531915 0.4966973 0.5794743 0.6956522 0
## 500 0.2553191 0.4268116 0.5432624 0.4804654 0.5531915 0.5882353 0
## 550 0.2553191 0.4268116 0.5432624 0.4804654 0.5531915 0.5882353 0
## 600 0.2553191 0.4268116 0.5432624 0.4804654 0.5531915 0.5882353 0
## 800 0.2553191 0.4268116 0.5432624 0.4804654 0.5531915 0.5882353 0
## 1000 0.2553191 0.4268116 0.5432624 0.4804654 0.5531915 0.5882353 0
## 2000 0.2553191 0.4268116 0.5432624 0.4804654 0.5531915 0.5882353 0
The best value of ntree is 300. So finally, we got: - mtry = 10 - maxnodes = 13 - ntree = 300
Predict model
Now we evaluate model:
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 8 9
## 1 6 11
##
## Accuracy : 0.5588
## 95% CI : (0.3789, 0.7281)
## No Information Rate : 0.5882
## P-Value [Acc > NIR] : 0.7017
##
## Kappa : 0.1176
##
## Mcnemar's Test P-Value : 0.6056
##
## Sensitivity : 0.5714
## Specificity : 0.5500
## Pos Pred Value : 0.4706
## Neg Pred Value : 0.6471
## Prevalence : 0.4118
## Detection Rate : 0.2353
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.5607
##
## 'Positive' Class : 0
##
We got the accuracy of 0.7222 percent, which is higher than the default value.
Visualise the result